An accurate home price prediction algorithm can reduce volatility in the housing market and take into account existing factors that may not be reflected in a home’s previous selling prices (e.g., new roof, new shopping center, etc.) However, predictive algorithms can also be exceedingly difficult to perfect. A falsely high average estimate in a neighborhood might lead home sellers to list their homes at too high an asking price and dragging out the process of selling their home, thereby introducing friction into the housing market. A falsely low estimate may depress the value of what is oftentimes a homeowner’s most valuable asset.
This project attempts to predict housing prices in metropolitan Miami by taking into consideration a home’s unique features (e.g., fence, patio) as well as considering local amenities and external features like schools, parks, and access to major roads. One interesting finding from this process is that a home’s location in a middle school zone shows a positive relationship with home prices, but bot elementary or high school zone.
To create our model, we converted our features of interest into variables that can be fed into an OLS regression model. We tested each featured for correlation with home sale prices and fine-tuned our model until we were able to minimize error.
After testing and rejecting several features that did not deduce our prediction errors (e.g., distances to nearest park, major road, and middle school), we ultimately settled on the features (dependent variables) listed below.
The map below shows the spatial distribution of home prices in Miami and Miami Beach. Darker points represent more expensive homes, with the deepest purple shade representing any home 1 million dollars or higher. Given the extreme range of home prices in Miami (max approx. 27 million dollars), we felt it necessary to collapse the outlier homes into the highest tier of home prices.
Here we see the spatial relationship between home sale prices and distance from the shore. Unsurprisingly, as we move farther inland, home prices decrease.
This map shows the relationship between middle school attendance zones and home sale prices.
Below is a map of the percent of White residents in each Census tract in Miami and Miami Beach. As shown below, although having a higher percentage of White residence does not appear to be closely correlated with home price, the absence of White residents is clearly tied to a lower estimate of home price.
Several other features such as distance from major roads, parks, and location within elementary and high school attendance zones were tested, but did not prove to be relevant. As shown below, there appears to be little relationship between distance from major roads and home price.
| Statistic | N | Mean | St. Dev. | Min | Max |
| SalePrice | 2,066 | 405,476.400 | 199,741.700 | 12,500 | 1,000,000 |
| LotSize | 2,066 | 6,360.875 | 1,721.617 | 1,250 | 17,620 |
| Age | 2,066 | 70.954 | 18.186 | -1 | 115 |
| Stories | 2,066 | 1.073 | 0.265 | 0 | 3 |
| Bed | 2,066 | 2.692 | 0.794 | 0 | 8 |
| Bath | 2,066 | 1.611 | 0.700 | 0 | 6 |
| Pool | 2,066 | 0.108 | 0.310 | 0 | 1 |
| Fence | 2,066 | 0.738 | 0.440 | 0 | 1 |
| Patio | 2,066 | 0.499 | 0.500 | 0 | 1 |
| Shore1 | 2,066 | 7,047.549 | 5,248.614 | 88.597 | 26,528.540 |
| MedRent | 2,040 | 1,042.535 | 311.133 | 246.000 | 2,297.000 |
| pctWhite | 2,062 | 0.703 | 0.320 | 0.057 | 0.989 |
| pctPoverty | 2,062 | 0.217 | 0.108 | 0.052 | 0.556 |
| Brownsville.MS | 1,588 | 0.098 | 0.298 | 0.000 | 1.000 |
| CitrusGrove.MS | 1,588 | 0.115 | 0.319 | 0.000 | 1.000 |
| JosedeDiego.MS | 1,588 | 0.129 | 0.335 | 0.000 | 1.000 |
| GeorgiaJA.MS | 1,588 | 0.133 | 0.340 | 0.000 | 1.000 |
| KinlochPk.MS | 1,588 | 0.196 | 0.397 | 0.000 | 1.000 |
| Madison.MS | 1,588 | 0.001 | 0.035 | 0.000 | 1.000 |
| Nautilus.MS | 1,588 | 0.061 | 0.240 | 0.000 | 1.000 |
| Shenandoah.MS | 1,588 | 0.243 | 0.429 | 0.000 | 1.000 |
| WestMiami.MS | 1,588 | 0.024 | 0.153 | 0.000 | 1.000 |
Below is a correlation matrix, showing the relatedness of each numeric variable to every other. The red-bounded box shows each variable’s correlation with sale price, our dependent variable.
The below plots show the linear relationship between 4 independent variables, and home prices. Actual square footage is most highly and positively correlated with home price. Median rent in a home’s area also has a small positive relationship, and distance from the shore and age are quite expectedly negatively correlated (i.e., as the age of a home increases, home price decreases).
| Dependent variable: | ||
| SalePrice | ||
| (1) | (2) | |
| Folio | 0.00000 | |
| (0.00000) | ||
| Property.CityMiami Beach | 220,589.400** | |
| (102,729.500) | ||
| LotSize | 17.974*** | |
| (1.660) | ||
| Bed | 8,653.610* | |
| (4,483.327) | ||
| Bath | 4,613.343 | |
| (5,440.150) | ||
| Stories | 13,854.920 | |
| (11,214.380) | ||
| Pool | 77,281.650*** | |
| (9,820.892) | ||
| Fence | -149.050 | |
| (5,646.349) | ||
| Patio | 4,073.120 | |
| (5,115.939) | ||
| ActualSqFt | 67.033*** | |
| (6.231) | ||
| Age | -698.975*** | |
| (147.148) | ||
| Shore1 | -5.745*** | -3.369*** |
| (1.229) | (1.030) | |
| MedHHInc | 1.370*** | 1.091*** |
| (0.234) | (0.193) | |
| TotalPop | 4.453** | 3.400** |
| (1.797) | (1.481) | |
| MedRent | 8.011 | 10.111 |
| (16.927) | (13.970) | |
| pctWhite | 87,738.750*** | 72,584.590*** |
| (21,213.910) | (18,038.400) | |
| pctPoverty | -65,160.580 | -28,329.320 |
| (46,955.060) | (38,669.420) | |
| Brownsville.MS | -74,321.900** | -19,595.140 |
| (35,665.230) | (29,848.970) | |
| CitrusGrove.MS | -63,085.380* | -16,536.970 |
| (33,757.230) | (27,908.010) | |
| JosedeDiego.MS | -24,574.030 | 41,689.690 |
| (36,087.170) | (30,171.430) | |
| GeorgiaJA.MS | -83,659.540** | -21,745.790 |
| (33,968.440) | (28,569.970) | |
| KinlochPk.MS | -20,898.120 | 7,151.547 |
| (23,871.550) | (19,733.870) | |
| Madison.MS | -102,752.000 | -24,901.940 |
| (89,066.670) | (73,254.900) | |
| Nautilus.MS | 229,234.000*** | |
| (37,774.810) | ||
| Shenandoah.MS | 99,482.560*** | 122,292.100*** |
| (31,097.140) | (25,991.120) | |
| WestMiami.MS | ||
| Constant | 274,559.400*** | -5,246.435 |
| (50,967.110) | (142,697.300) | |
| Observations | 1,584 | 1,584 |
| R2 | 0.603 | 0.736 |
| Adjusted R2 | 0.599 | 0.732 |
| Residual Std. Error | 113,746.600 (df = 1569) | 93,070.580 (df = 1559) |
| F Statistic | 170.158*** (df = 14; 1569) | 180.949*** (df = 24; 1559) |
| Note: | p<0.1; p<0.05; p<0.01 | |
The first regression we combined our feature engineering variables to see which were statistically significant. The second regression includes all of the off-the-shelf features with our custom features. Model improves a lot by R2.
| intercept | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|
| TRUE | 130147.8 | 0.7933946 | 93060.55 | 354594.9 | 0.2778949 | 210570.7 |
| intercept | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|
| TRUE | 98912.42 | 0.709759 | 82492.1 | 52113.41 | 0.271611 | 39852.27 |